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Abstract. Online service providers gather increasingly large amounts of personal data into user profiles and mon- 
etize them with advertisers and data brokers. Users have little control of what information is processed and face an 
all-or-nothing decision between receiving free services or refusing to be profiled. This paper explores an alternative 
approach where users only disclose an aggregate model - the "gist" - of their data. The goal is to preserve data 
utility and simultaneously provide user privacy. We show that this approach is practical and can be realized by let- 
ting users contribute encrypted and differentially-private data to an aggregator. The aggregator combines encrypted 
contributions and can only extract an aggregate model of the underlying data. In order to dynamically assess the 
value of data aggregates, we use an information-theoretic measure to compute the amount of "valuable" information 
provided to advertisers and data brokers. 

We evaluate our framework on an anonymous dataset of 100,000 U.S. users obtained from the U.S. Census Bureau 
and show that (i) it provides accurate aggregates with as little as 100 users, (ii) it generates revenue for both users 
and data brokers, and (iii) its overhead is appreciably low. 



1 Introduction 

The digital footprint of Internet users is growing at an unprecedented pace, boosted not only by the 
increasing number of activities performed online, but also by the billions of posts, likes, check-ins, and 
multimedia content shared everyday. This creates invaluable sources of information that can be used to 
profile users and serve behavioral targeted advertisement. With $8 Billion annual revenue in 2013, Facebook 
is the prime example of a company successfully monetizing personal data with advertisers and data brokers. 

This economic model, however, raises major privacy concerns [1 1,16,40,42] as advertisers might exces- 
sively track users, data brokers might illegally market consumer profiles [48], and governments might abuse 
their surveillance power [17,18] by obtaining datasets collected for monetization purposes. Consequently, 
consumer advocacy groups pressured policies and legislations [30,41] providing greater control to users and 
more transparent collection practices (e.g., the EU cookie law). 

Along these lines, several efforts — such as OpenPDS, personal.com, Sellbox, and Handshake — advocate 
a novel, user-centric paradigm: users store their personal information in "data vaults", and directly manage 
with whom to share their data. This approach has several advantages: on one hand, users maintain data 
ownership (and may monetize their data); on the other hand, data brokers and advertisers benefit from more 
accurate and detailed personal information [26,44]). Nevertheless, privacy still remains a challenge as users 
need to trust data vaults operators and relinquish their profiles to advertisers [7,38]. 



* Work done, in part, while authors were at PARC. 



To address these concerns, the research community proposes to maintain data vaults on user devices and 
share data in a privacy-preserving way. Existing solutions can be grouped into three categories: methods 
that (1) run advertising locally without revealing any information to advertisers/data brokers [19,28,43]; 
(2) rely on a trusted third party to anonymize user data [4,35]; and (3) rely on an trusted third party for 
private user data aggregation [2,9,10]. Unfortunately, these approaches suffer from several limitations which 
hinder their adoption. Localized methods prevent data brokers and advertisers from obtaining user statistics. 
Anonymization techniques provide advertisers with significantly reduced data utility and are prone to re- 
identification attacks [29]. Finally, existing private aggregation schemes rely on a trusted third party for 
differential privacy (e.g., a proxy [10], a website [2], or mixes [9]). Also, aggregation occurs after decryption, 
thus making it possible to link contributions and users. 

Motivated by the above challenges, we propose a novel approach to the privacy -preserving monetization 
of user data. Rather than contributing data as-is, users combine their data into an aggregate model - the 
"gist." Intuitively, users contribute encrypted and differentially-private data to an aggregator that extracts a 
statistical model of the underlying data (e.g., probability density function of the age of contributing users). 
Our approach addresses issues with existing work in that it does not depend on a third-party for differen- 
tial privacy, incurs low computational overhead, and addresses linkability issues between contributions and 
users. Moreover, we propose a metric to dynamically value user statistics according to their inherent amount 
of "valuable" information (i.e., sensitivity): for instance, aggregators can assess whether age statistics in a 
group of participants are more sensitive than income statistics. To the best of our knowledge, our solution 
provides the first privacy -preserving aggregation scheme for personal data monetization. 

Our contributions can be summarized as follows: 

1. We design a privacy-preserving framework for monetizing user data. Users trade an aggregate of their 
data instead of actual values. 

2. We define a measure of the sensitivity of different data aggregates. In particular, we adopt the information- 
theoretic Jensen-Shannon divergence [24] to quantify the distance between the actual distribution of a 
data attribute, and a distribution that does not reveal actionable information [15], such as the uniform 
distribution. 

3. We show how to rank aggregates based on their sensitivity, i.e., we design a dynamic valuation scheme 
based on how much information an aggregate leaks. 

We evaluate our privacy-preserving framework on a real, anonymized dataset of 100,000 US users (ob- 
tained by the Census Bureau) with different types of attributes. Our results show that our framework (i) 
provides accurate aggregates with as little as 100 participants, (ii) generates revenue for users and data 
aggregators depending on the number of contributing users and sensitivity of attributes, and (iii) has low 
computational overhead on user devices (0.3 ms for each user, independently of the number of participants). 
Interestingly, we find that data brokers have an incentive to direct their investments on small groups of 
users representative of a certain population. In summary, our approach provides a novel perspective to the 
privacy-preserving monetization of personal data, and finds a successful balance between data accuracy for 
advertisers, privacy protection for users, and incentives for data aggregators. 

Paper Organization. The rest of the paper is organized as follows. Next section introduces the system 
architecture and the problem statement. Then, Section 3 presents our framework and Section 4 reports on 
our experimental evaluation. After reviewing related work in Section 5, the paper concludes in Section 6. 

2 System Architecture 

This section introduces the problem definition and presents participating entities. 
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Fig. 1: System architecture and basic protocol. Users contribute encrypted profiles to the aggregator. The ag- 
gregator combines encrypted profiles and obtains plaintext data models, which it monetizes with customers. 

2.1 Problem Statement 

We consider a system comprised of three entities: A set of users U = {1, . . . , N}, a data aggregator A, 
and a customer C. The system architecture is illustrated in Fig. 1. Customers query the data aggregator for 
user information, while users contribute their personal information to the data aggregator. The aggregator 
acts as a proxy between users and customers by aggregating (and monetizing) user data. The main goal of 
this paper is to propose practical techniques for the aggregator to aggregate and monetize users' data in a 
privacy-preserving way, i.e., without revealing personal information to other users or third parties. 

2.2 System Model 

Users. We assume that users store a set of personal attributes such as age, gender, and preferences locally. 
Each user i E U maintains a profile vector pi = [x^i, . . . , x^k], where E V is the value of attribute 
j and V is a suitable domain for j. For example, if j represents the age of user z, then Xij E {1, . . . , Mj}, 
Mj = 120, and V C N. 

In practice, users can generate their personal profiles manually, or leveraging profiles maintained by 
third parties. Several social networks allow subscribers to download their online profile. A Facebook profile, 
for example, contains numerous Personally Identifiable Information (PII) items (such as age, gender, rela- 
tionships, location), preferences (movies, music, books, tv shows, brands), media (photos and videos) and 
social interaction data (list of friends, wall posts, liked items). 

Following the results of recent studies on user privacy attitudes [3,7,26], we assume that each user i can 
specify a privacy- sensitivity value 0 < Ay < 1 for each attribute j. A large Ay indicates high privacy 
sensitivity (i.e., lower willingness to disclose). In practice, Ay can assume a limited number of discrete 
values, which could represent the different levels of sensitivity according to Westin's Privacy Indexes [22]. 

We also assume that users want to monetize their profiles while preserving their privacy. For instance, 
users may be willing to trade an aggregate of their online behavior, such as the frequency at which they visit 
different categories of websites, rather than the exact time and URLs. 

Finally, we assume that user devices can perform cryptographic operations consisting of multiplications, 
exponentiations, and discrete logarithms. 

Data Aggregator. A data aggregator A is an untrusted third-party that performs the following actions: (1) it 
collects encrypted attributes from users, (2) it aggregates contributed attributes in a privacy-preserving way, 
and (3) it monetizes users' aggregates according to the amount of "valuable" information that each attribute 
conveys. 

We assume that users and A sign an agreement upon user registration that authorizes A to access only 
the aggregated results (but not users' actual attributes), to monetize them with customers, and to take a share 
of the revenue from the sale. It also binds A to redistribute the rest of the revenue among contributing users. 
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Customer. We consider a customer C that wants to obtain aggregate information about users and is willing 
to pay for it. C can have commercial contracts with multiple data brokers. Similarly, a data aggregator can 
have contracts with multiple customers. (Without loss of generality, we consider one customer and one 
aggregator to ease presentation). C interacts with a data aggregator A but not directly with users. C obtains 
available attributes, and initiates an aggregation by querying the data aggregator for specific attributes. 

2.3 Applications 

We argue that the proposed system model is well-suited to many real-world scenarios, including mar- 
ket research and online tracking use cases. For instance, consider a car dealer C that wants to assess user 
preferences for car brands, their demographics, and income distributions. A data aggregator A might collect 
aggregate information about a representative set of users U and monetize it with the car dealer C. Compa- 
nies such as Acxiom currently provide this service, but raise privacy concerns [39]. Our solution enables 
such companies to collect aggregates of personal data instead of actual values and reward users for their 
participation. 

Another example is that of an online publisher (e.g., a news website) C that wishes to know more about 
its online readers [2]. In this case, the aggregator A is an online advertiser that collects information about 
online users U and monetizes it with online publishers. 

Finally, our proposed model can also be appealing to data aggregators in healthcare [12]. Healthcare 
data is often fragmented in silos across different organizations and/or individuals. An healthcare aggregator 
A can compile data from various sources and allow third parties C to buy access to the data. At the same 
time, data contributors (U) receive a fraction of the revenue. Our approach thwarts privacy concerns and 
helps with the pricing of contributed data. 

2.4 Threat Model 

In modeling security, we consider both passive and active adversaries. 

Passive adversaries. Semi-honest (or honest-but-curious) passive adversaries monitor user communications 
and try to infer the individual contributions made by other users. For instance, users may wish to obtain 
attribute values of other users; similarly, data aggregators and customers may try to learn the values of the 
attributes from aggregated results. A passive adversary executes the protocol correctly and in the correct 
order, without interfering with inputs or manipulating the final result. 

Active adversaries. Active (or malicious) adversaries can deviate from the intended execution of the pro- 
tocol by inserting, modifying or erasing input or output data. For instance, a subset of malicious users may 
collude with each other in order to obtain information about other (honest) users or to bias the result of the 
aggregation. To achieve their goal, malicious users may also collude with either the data aggregator or with 
the customer. Moreover, a malicious data aggregator may collude with a customer in order to obtain private 
information about the user attributes. 

3 Monetizing User Profiles with Privacy 

We outline and formalize the data monetization framework, which consists of a protocol that is executed 
between users U, a data aggregator A and a customer C. We first provide an intuitive description and then 
detail each individual component. 
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3.1 High-Level Description 

We propose a protocol where users trade personal attributes in a privacy-preserving way, in exchange for 
(possibly) monetary retributions. Intuitively, there are two possible modes of implementations: interactive 
and batch. 

In interactive mode, a customer initiates a query about specific attributes and users. The aggregator se- 
lects the users matching the query, collects encrypted replies, computes aggregates, and monetizes them 
according to a pricing function. In batch mode, users send their encrypted profile, containing personal at- 
tributes, to the data broker. The aggregator combines encrypted profiles, decrypts them, obtains aggregates 
for each attribute, and ranks attributes based on the amount of "valuable" information they provide. A cus- 
tomer is then offered access to specific attributes. Without loss of generality, hereafter we describe the 
interactive mode. 

Initialization: The data aggregator A and users i E U engage in a secure key establishment protocol to 
obtain individual random secret keys Sj, where so is only known to A and Si (Vi E U) is only known to 
user z, such that so + si + ... + SiV = 0 (this condition is required for the aggregation of the data that 
will be described later). Any secure key establishment protocol or trusted dealer can be used in this phase 
to distribute the secret keys, as long as the condition on their sum is respected. The initialization phase 
is the same as in [36]. Each user i generates its profile vector pi E V K containing personal attributes 
j€{l,...,K}. 

1. Customer Query: A customer queries the aggregator. The query contains information about the type of 
aggregates and users. In practice, it could be formatted as an SQL query. 

2. User Selection: The aggregator selects users based on the customer query. To do so, we consider that 
users shared some basic information with the aggregator, such as their demographics. Another option is 
for the aggregator to forward the customer query to users, and let users decide whether to participate or 
not. 

3. Aggregator Query: The aggregator forwards the customer's query to the users, together with a public 
feature extraction function /. 

4. Feature Extraction: Each user i can optionally execute a public feature extraction function / : V K — » 
O l on pi, where L is the dimension of the output feature space O, thus resulting in a feature vector fj. 
The goal of feature extraction is to enable the aggregation of user features instead of raw user attributes, 
e.g., a feature could capture how a user contributes to social networks. 

5. Encryption and Obfuscation: Each user adds a random noise value to fi, obtaining fj, and encrypts it. 
Encryption and obfuscation provide strong guarantees both in terms of data confidentiality and differen- 
tial privacy [13]. Each user sends the encrypted vector £(fj) to A. 

6. Aggregation, Decryption, and Pricing: A combines all £ (fi) and decrypts the result, generating one 2- 
tuple { Vj , Wj } for each attribute j. These tuples are used to approximate the probability density function 
of attributes across users. A uses {V}, Wj} to create a discrete sampled probability distribution function 
dj\fj for each attribute j. A then computes a distance measure dj = d(d/\fj,dUj) E [0, 1] between dA/} 
and dUj, where dUj is a discrete uniform distribution in the interval [m J 5 Mj]. A small/large distance 
corresponds to an attribute with low/high information "value", as described later in the text. 

A determines the cost Cost(j) of each attribute j by taking into account both the distances dj, the number 
of contributing users, and the price per attribute. We describe the pricing scheme in the Appendix. 

7. Answer: A sends a set of 2-tuples {(d Pz , Cost(p z ))}^ z=1 to C, who decides which aggregates to pur- 
chase. After the purchase, A obtains a share of the total sale revenue and equally distributes the remainder 
to users. 



5 



3.2 Detailed Description 

We detail the functions and primitives for the aggregation and monetization of user data. In this paper, 
we compute aggregates by estimating the probability density function (pdf ) of user attributes. We use the 
Gaussian approximation to estimate pdfs for two reasons. First, existing work shows that this will lead to 
precise aggregates with few users. The CLT [23,34] states that the arithmetic mean of a sufficiently large 
number of independent random variables, drawn from distributions of expected value /i and variance a 2 , 
will be approximately normally distributed iV(/i, a 2 ). Second, the Gaussian pdf is fully defined by these 
two parameters and thus we do not need additional coordination among users (after the initialization phase). 
For information leakage ranking, we use a well-established information-theoretic distance function. 

For conciseness, we focus on the description of privacy-preserving aggregation and pricing (phases 
4 to 6, i.e., feature extraction, encryption, aggregation and ranking). The pricing part is described in the 
Appendix. With respect to the initialization and query forwarding phases (1-3), our method is general enough 
and can be adapted to any specific implementation. 



Phase 4-5: Feature Extraction and Encryption. Each user i generates a profile vector pi = [x^i , . . . , x^k] • 
Each attribute j takes value Xij E {raj, . . . , Mj}, where raj, Mj e Z p are the minimum and maximum 
value. Note that computations are in cyclic group Z p and p is a prime order. Remember that in practice, a 
user can derive pi either from an existing online profile (e.g., Facebook) or by manually entering values Xij. 
In our evaluation, we use values from the U.S. Census Bureau [45,46]. 

To guarantee (e, S) -Differential Privacy, each user i adds noise rij^Oij to attribute values sampled from 
a symmetric Geometric distribution according to Algorithm 1 in [36]. In particular, in the following we add 
noise to both x^j and x 2 -, as they will be subsequently combined to obliviously compute the parameters of 
the model that underlies the actual data: 
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where p is the prime order [36]. 

With xlj and Xij^K each user generates the following encrypted vectors (cj, bi) 
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Each user i then sends (c;, bi) to A. Note that the encryption scheme guarantees that A is unable to de- 
crypt the vectors (c^, bi). However, thanks to its own secret share so, A can decrypt aggregates as explained 
hereafter. 



Phase 6: Privacy-Preserving Aggregation and Pricing. To compute the sample mean /ij and variance a 2 - 

without having access to the individual values x^j , x^j ^ of any user i, A first computes the intermediate 
values: 
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To obtain (fij, cr|), A takes the discrete logarithm base g of {Vj, W 3 }\ 
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Finally, using the derived (jlj, <r|), A computes the Gaussian pdf for each of the K attributes. 

In order to estimate the amount "valuable" information (i.e., sensitivity) that each attribute leaks, we 
propose to measure the distance between Mj and the Uniform distribution U (that does not leak any infor- 
mation [20]). A related concept was studied in [15,21] for measuring the "interestingness" of textual data 
by comparing it to an expected model, usually with the Kullback-Liebler divergence. To the best of our 
knowledge, we are the first to explore this approach in the context of information privacy. Instead of the KL 
divergence, we rely on the Jensen-Shannon ( JS) divergence for two reasons: (1) JS is a symmetric and (2) 
bounded equivalent of the KL divergence. It is defined as: 



JS(u, q) = \kL(u, m) + ^KL(q, m) = H(^u + p) - ^H(u) 



1 



H{q) 



where m = u/2 + q/2 and H is the Shannon entropy. As JS is in [0, 1] (when using the logarithm base 
2), it quantifies the relative distance between Mj and Uj, and also provides absolute comparisons with 
distributions different from the uniform. 

As JS operates on discrete values, A must first discretize distributions Mj and Uj. Given the knowledge 
of intervals {mj , . . . , Mj } for each attribute j, we can use Riemann's centered sum to approximate a definite 
integral, where the number of approximation bins is related to the accuracy of the approximation. We choose 
the number of bins to be Mj — mj, and thus guarantee a bin width of 1. We approximate J\fj by the discrete 
random variable d/V} with the following mass function: 



Pr(dAO) 



Pr(xj 
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/ pdfjdimj + mj - 1))\ 
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where pdfj is the probability density function of Afj and xj E {m j 5 . . . , Mj}. For the uniform distribution 
Uj, the discretization to dUj is straightforward, i.e., Pr(dWj) = (1/ (Mj — rrij) , . . . , 1/ (Mj — rrij)) T , where 
dim(dZ^) = K. 

A can now compute distances dj = JS(dJ\fj 1 dUj) E [0, 1] and rank attributes in increasing order of 

< d PK , where p\ — arg min • dj and p z (for 2 < z < K) are 



information leakage such that d pi < d P2 < 
defined as p z = arg min^^-i (dj) 

At this point, A computed the 3-tuple (d Pjl jlj, cr?) for each attribute j. Each user i can now decide 
whether it is comfortable sharing attribute j given distance dj and privacy sensitivity Xij. To do so, each 
user i sends Xij to A for comparison. A then checks which users are willing to share each attribute j 
and updates the ratio jj = Sj/N, where Sj is the number of users that are comfortable sharing, i.e., 
Sj = \{i E U s.t. dj < 1 — In practice, A could then use the majority rule to decide whether or 

not to monetize attribute j. 

After this ranking phase, the data broker A concludes the process with the pricing and revenue phases 
described in the Appendix. 
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4 Evaluation 

To test the relevance and the practicality of our privacy -preserving monetization solution, we measure 
the quality of aggregates, the overhead, and generated revenue. In particular, we study how the number of 
protocol participants and their privacy sensitivities affect the accuracy of the Gaussian approximations, the 
computational performance, the amount of information leaked for each attribute, and revenue. 

4.1 Setup 

We consider secret shares in Z p where p is a 1024 bits modulus, the number of users N E [10, 100000], 
and each user i with profile pi. We implemented our privacy-preserving protocol in Java, and rely on public 
libraries for secret key initialization, for multi-threading decryption, and on the MALLET [27] package for 
computation of the JS divergence. 

We run our experiments on a machine equipped with Mac OSX 10.8.3, dual-core Core i5 processor, 2.53 
GHz, and 8 GB RAM. Measurements up to 100 users are averaged over 300 iterations, and the rest (from 
lk to 100k users) are averaged over 3 iterations due to large simulation times. 

We populate user profiles with U.S. Census Bureau information [45,46]: We obtained anonymized of- 
fline and online attributes about 100,000 people. We pre-processed the acquired data by removing incom- 
plete profiles (i.e., some respondents prefer not to reveal specific attributes). 

Without loss of generality, we focus on three types of offline attributes: Yearly income level, educa- 
tion level and age. We selected these attributes because (1) a recent study [7] shows that these attributes 
have high monetary value (and thus privacy sensitivity), and (2) they have significantly different distribu- 
tions across users. This allows us to compare retribution models, and measure the accuracy of the Gaussian 
approximation for a variety of distributions. 

Table 1 shows the mean and standard deviation for the three considered attributes with a varying number 
of users. Note that the provided values for income and education use a specific scale defined by the Census 
Bureau. For example, a value of 1 and 16 for education correspond to "Less than 1st grade" and "Doctorate", 
respectively. 

We could consider other types of attributes as well, such as internet, music and video preferences from 
alternative sources, such as Yahoo Webscope [47]. Although an exhaustive comparison of the monetization 
of all different attributes is an exciting perspective, it is out of the scope of this paper and we leave this for 
future work. 

4.2 Results 

We evaluate four aspects of our privacy-preserving scheme: model accuracy, information leakage, over- 
head and pricing. 

Model Accuracy. In our proposal, we approximate empirical probability density functions with Gaussian 
distributions. The accuracy of approximations is important to assess the relevance of derived data models. In 
Fig. 2, we compare the actual distribution of each attribute with their respective Gaussian approximation and 
vary the number of users from 100 to 100,000. Note that in order to compare probabilities over the domain 
[raj, Mj], we scaled both the actual distribution and the Gaussian approximation such that their respective 
sums over that domain are equal to one. We observe that, visually, the Gaussian approximation captures 
general trends in the actual data. 
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Number of randomly selected users in the dataset 


10 


100 


lk 


10k 


50k 


100k 


| (mj,Mj) 


6.50 
19.17 
(1,10) 


9.72 
18.70 
(1,15) 


10.30 
20.04 
(1,15) 


10.87 
17.05 
(1,16) 


10.83 
16.72 
(1,16) 


10.89 
16.52 
(1,16) 


fx 


5.70 
15.57 
(1,9) 


7.23 
7.07 
(1,12) 


10.29 
7.96 
(1,15) 


10.38 
7.68 
(1,16) 


10.21 
7.73 
(1,16) 


10.18 
7.63 
(1,16) 


- a 

(m u Mj) 


38.10 
252.54 
(11,67) 


35.40 
502.79 
(1,85) 


41.91 
563.32 
(0,85) 


42.44 
546.40 
(0,85) 


41.49 
553.68 
(0,85) 


39.79 
539.60 
(0,85) 



Table 1: Summary of U.S. Census dataset used for the evaluation: we considered three types of attributes 
(income, education, and age), which reflect different types of sample distributions (as shown in Fig. 2). 



We measure the accuracy of the Gaussian approximation in more details with the JS divergence (Fig. 
3a). We observe that with 100 users, the approximation reaches a plateau for education, whereas income 
and age require lk users to converge. For the two latter attributes, the approximation accuracy triples when 
increasing from 100 to lk users. Moreover, as the number of user increases, the fit of the Gaussian model 
for income and age is two times better (JS of 0.05 bits) than for education (JS of 0.1 bits). The main reason 
is that education has more data points with large differences between actual and approximated distributions 
than income and age (as shown in Fig. 2). 

These results indicate that, for non-uniform distributions, the Gaussian approximation is accurate with a 
relatively small number of users (about 100). It is interesting to study this result in light of the Central Limit 
Theorem (CLT). Remember that the CLT states that the arithmetic mean of a sufficiently large number of 
variables will tend to be normally distributed. In other words, a Gaussian approximation quickly converges 
to the original distribution and this confirms the validity of our experiments. This also means that C can 
obtain accurate models even if it requests aggregates about small groups of users. In other words, collecting 
data about more than lk users does not significantly improve the accuracy of approximations, even for more 
extreme distributions. 

Information Leakage. We compare the divergence between Gaussian approximations and uniform dis- 
tributions to measure the information leakage of different attributes. Fig. 3b shows the sensitivity for each 
attribute with a varying number of users. We observe that the amount of information leakage stabilizes 
for all attributes after a given number of participants. In particular, education and age reach a maximum 
information leakage with lk users, whereas 10k users are required for income to achieve the same leakage. 

Overall, we observe that education is by far the attribute with the largest distance to the uniform distri- 
bution, and therefore arguably the most valuable one. In comparison, Income and age are 50% and 75% less 
"revealing". Information leakage for age decreases from 100 to lk users, as age distribution in our dataset 
tends towards a uniform distribution. In contrast, education and income are significantly different from a 
uniform distribution. An important observation is that the amount of valuable information does not increase 
monotonically with the number of users: For age, it decreases by 30% when the number of users increases 
from 100 to lk, and for education it decreases by 3% when transitioning from 10k to 5k users. 

These findings show that larger user samples do not necessarily provide better discriminating features. 
This also shows that users should not decide whether to participate in our protocol solely based on a fixed 
threshold over total participants, as this may prove to leak slightly more private information. 
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(b) Attribute education, sampled from 100 users (left), lk users (middle) and 100k users (right). 
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(c) Attribute age, sampled from 100 users (left), lk users (middle) and 100k users (right). 
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Fig. 2: Gaussian approximation vs. actual distribution for each considered attribute. 



Overhead. We measure the computation overhead for both users and the data broker. For each user, we find 
that one execution of the protocol requires 0.284 ms (excluding communication delays), out of which 0.01 
ms are spent for the profile generation, 0.024 ms for the feature extraction, 0.026 ms for the differential- 
privacy noise addition, and 0.224 ms for encryption of the noisy attribute. In general, user profiles are not 
subject to change within short time intervals, thus suggesting that user-side operations could be executed on 
resource-constrained devices such as mobile phones. 

From Fig. 3c, observe that the data broker requires about one second to complete its phases when there 
are only 10 users, 1.5 min with 100 users, 15 min with lk users, and 27.7 h for 100k users. Note, however, 
that running times can be remarkably reduced using algorithmic optimization and parallelization, which is 
part of our future work. In our results, decryption is the most time-consuming operation for the data broker 
as it incurs (0(N • Mj))\ this could be reduced to 0(^/N • Mj) by using the Pollard's Rho method for 
computing the discrete logarithm [33]. Also, decryption can be speedup up by splitting decryption operations 
across multiple machines (i.e., the underlying algorithm is highly-parallelizable). 
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(a) Divergence between the Gaussian approximation and 
the actual distribution of each attribute j, computed as the 
JS(djVj , Actual,). Lower values indicate better accuracy. 
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(b) Information leakage for each type of attribute j (income, 
education and age), defined as J S \6Afj , 6Uj) . Lower values 
indicate smaller information leaks. 
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(c) Performance measurements for each of the four phases of (d) Relative revenue (per attribute) for each user i G U and the 
the protocol performed by the data broker. data aggregator A, assuming that an attribute is valued at 1. 

Fig. 3: Results of the evaluation of the proposed framework on the U.S. Census dataset. 



Pricing. Recall that the price of an attribute aggregate depends on the number of contributing users, the 
amount of information leakage, and the cost of the attribute. We consider that each attribute j has a unit 
cost of 1 and the data broker takes a commission ujj. Details about the revenue sharing model and pricing 
are discussed in the Appendix. We consider three types of privacy sensitivities A: (i) a uniform random 
distribution of privacy sensitivities Ay for each user i and for each attribute j, (ii) an individual privacy 
sensitivity A^ for each user (same across different attributes), and (iii) an all-share scenario (A^ = 0 and all 
users contribute). The commission percentage is set to uo 3 ■ = uo = 0.1. 

Fig. 3d shows the average revenue generated from one attribute by the data broker and by users. We 
observe that user revenue is small and does not increase with the number of participants. In contrast, the 
data broker revenue increases linearly with the number of participants. In terms of privacy sensitivities, we 
observe that with higher privacy sensitivities (A^ > 0), fewer users contribute, thus generating lower revenue 
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overall and per user. For example, users start earning revenue with 10 participants in the all-share scenario, 
but more users are required to start generating revenue if users adopt higher privacy sensitivities. 

We observe that users are incentivized to participate as they earn some revenue (rather than not bene- 
fiting at all), but the generated revenue does not generate significant income, thus, it might encourage user 
participation from "biased" demographics (e.g., similar to Amazon Mechanical Turk). In contrast, the data 
broker has incentives to attract more users, as it revenue increases with the number of participants. However, 
customers are incentivized to select fewer users because cost increases with the number of users, and 100 
users provide as good an aggregate as 1000 users. This is an intriguing result, as it encourages customers to 
focus on small groups of users representative of a certain population category. 

4.3 Security 

Passive adversary. To ensure privacy of the personal user attributes, our framework relies on the security of 
the underlying encryption and differential-privacy methods presented in [36]. Hence, no passive adversary 
(a user participating in the monetization protocol, the data aggregator or an external party not involved in 
the protocol) can learn any of the user attributes, assuming that the key setup phase has been performed 
correctly and that a suitable algebraic group (satisfying the DDH assumption) with a large enough prime 
order (1024 bits or more) has been chosen. 

Active adversary. As per [36], our framework is resistant to collusion attacks among users and between 
a subset of users and the data broker, as each user i encrypts its attribute values with a unique and secret 
key However, pollution attacks, which try to manipulate the aggregated result by encrypting out-of- 
scope values, can affect the aggregate result of our protocol. Nevertheless, such attacks can be mitigated 
by including, in addition to encryption, range checks based on efficient (non-interactive) zero-knowledge 
proofs of knowledge [5,6,25]: each user could submit, in addition to the encrypted values, a proof that such 
values are indeed in the plausible range specified by the data aggregator. However, even within a specific 
range, a user can manipulate its contributed value and thus affect the aggregate. Although nudging users to 
reveal their true attribute value is an important challenge, it is outside of the scope of this paper. 

5 Related Work 

Our work builds upon two main domains, in order to provide the privacy and incentives for the users and 
data aggregators: (1) privacy-preserving aggregation [14,36,37,49], and (2) privacy-preserving monetization 
of user profiles [4,19,35,43]. Hereafter we discuss these two sets of works. 

5.1 Privacy-Preserving Aggregation 

Erkin and Tsudik [14] design a method to perform privacy-preserving data aggregation in the smart grid. 
Smart meters jointly establish secret keys without having to rely on a trusted third party, and mask individual 
readings using a modified version of the Paillier encryption scheme [32]. The aggregator then computes the 
sum of all readings without seeing individual values. Smart meters must communicate with each other, thus 
limiting this proposal to online settings. Shi et al. [37] compute the sum of different inputs based on data 
slicing and mixing with other users, but have the same limitation: all participants must actively communicate 
with each other during the aggregation. 

Another line of work [9,10] introduces privacy-preserving aggregation by combining homomorphic en- 
cryption and differential privacy, i.e., users encrypt their data with the customer public key and send it to a 
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trusted aggregator. The aggregator adds differential noise to encrypted values (using the homomorphic prop- 
erty), and forwards the result to the customer. The customer decrypts contributions and computes desired 
aggregates. These proposals suffer, however, from a number of shortcomings as (i) they rely on a trusted 
third party for differential privacy; (ii) they require at least one public key operation per single bit of user 
input, and one kilobit of data per single bit of user answer, or rely on XOR encryption; and (iii) contributions 
are linkable to users as aggregation occurs after decryption. 

Shi et al. [36] propose the first method to compute the sum of different inputs in a privacy-preserving 
fashion, without requiring communication among users, nor repeated interactions with a third party. The 
proposed scheme also provides differential privacy guarantees in presence of malicious users, and establishes 
an upper bound on the error induced by the additive noise. This work formally shows that a Geometric 
distribution provides (e, ^-differential privacy (DD) in Z p . We extend the cryptographic construct of [36] to 
support the privacy-preserving computation of probability distributions (in addition to sums). Intuitively, we 
use the proposed technique to compute the parameters of Gaussian approximations in a privacy-preserving 
way. As we maintain the same security assumptions, our framework preserves provable privacy properties. 
We intend to explore, as part of future work, the possibilities and properties of regression modeling and 
privacy-preserving computation of regression parameters [1,49], in addition to distributions. 

5.2 Privacy-Preserving Monetization 

Previous work investigated two main approaches to privacy-preserving Online Behavioral Advertise- 
ment (OBA). The first approach minimizes the data shared with third parties, by introducing local user 
profile generation, categorization, and ad selection [2,19,28,43]. The second approach relies on anonymiz- 
ing proxies to shield users' behavioral data from third parties - until users agree to sell their data [4,35]. 

Toubiana et al. [43] propose to let users maintain browsing profiles on their device and match ads with 
user profiles, based on a cosine-similarity measure between visited websites meta-data (title, URL, tags) and 
ad categories. Users receive a large number of ads, select appropriate ones, and share selected ads with ad 
providers (not revealing visited websites nor user details). Guha et al. [19] propose to do the ad matching 
with an anonymization proxy instead. Although the cost of such system is estimated at $0.01/user/year, 
such solution demands significant changes from web browser vendors and online advertisers. Akkus et al. 
propose to let users rely on the website publisher to anonymize their browsing patterns vis-a-vis the ad- 
provider. Their protocol introduces significant overhead: The website publisher must repeatedly interact 
with each visitor and forward encrypted messages to the ad-provider. 

Instead of local profiles, Riederer et al. [35] propose a fully centralized approach, where an anonymiza- 
tion proxy mediates interactions between users and website publishers. The proxy releases the mapping 
between IP addresses and long-term user identifiers only after users agree to sell their data to a customer, 
thus allowing the customer to link different visits by the same users. However, users have to entrust a third 
party with their personal information. 

In contrast, our framework does not rely on any additional user-side software, does not impose compu- 
tationally expensive cryptographic computation on user devices, and prevents the customer from learning 
individual user data. 

6 Conclusion 

As the amount and sensitivity of personal data handed over to service providers increases, so do privacy 
concerns. Users usually have little control of what information is processed by service providers and how it is 
monetized with advertisers. This work proposes a privacy-preserving alternative where users only disclose 
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an aggregate model of their profiles, by means of encrypted and differentially private contributions. Our 
solution tackles trust and incentive challenges: rather than selling data as-is, users trade a model of their data. 
Users monetize their profiles by dynamically assessing the value of data aggregates. We use an information- 
theoretic measure to compute the amount of valuable information provided to advertisers. 

We evaluate our framework on a real and anonymized dataset with more than 100,000 users (obtained 
from the U.S. Census Bureau) and show, with an experimental evaluation, that our solution (i) provides 
accurate aggregates with as little as 100 users, (ii) introduces low overhead for both users (less than 1ms on 
commodity hardware) and data aggregators, and (iii) generates revenue for both users and aggregators. 

As part of future work, we plan to enhance our scheme with new features. Fault-tolerant aggregation [8] 
can be integrated in order to allow users to join/leave dynamically without disrupting the scheme. Also, 
range checks for the encrypted user attributes, based on efficient zero-knowledge proofs, could thwart active 
pollution attacks. Finally, we intend to investigate schemes for targeting ads to users contributing data to the 
aggregation, by allowing the aggregator to select specific subgroups of users according to the customer's 
target population. 
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A Appendix 

A.l Pricing of User Attributes 

Prior work shows that users assign unique monetary value to different types of attributes depending on 
several factors, such as offline/online activities [7], type of third-parties involved [7], privacy sensitivity [3], 
amount of details and fairness [26]. 

We measure the value of aggregates depending on their sensitivity, the number of contributing users, 
and the cost of each attribute. Without loss of generality, we estimate the value of an aggregate j using the 
following linear model: 

Cost(j) = Price(j) • dj • N 
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where Price(j) is the monetary value that users assign to attribute j. Without loss of generality, we assume 
in our pricing scheme a relative value of 1 for each attribute. Existing work discussed the value of user 
attributes, and estimated a large range from $ 0.0005 to $33 [7,31] highlighting the difficulty in determining 
a fixed price. In practice, this is likely to change depending on the monetization scenario. 

A then sends the set of 2-tuples {{d Pz , Cost(p z ))}^ z=1 to C. Based on the tuples, C selects the set P of 
attributes it wishes to purchase. After the purchase is complete, A re-distributes revenue R among users and 
itself, according to the agreement stipulated with the users upon their first registration with A. 

We consider a standard revenue sharing monetization scheme, where the revenue is split among users 
and the data aggregator (i.e., aggregator takes commissions): 

12(A) = • Cost ti)i R tt) = ^ X> " "j) • Cost(j), Vi e U 

where gjj is the commission percentage of A. This system is popular in existing aggregating schemes [12], 
credit-card payments, and online stores (e.g., iOS App Store). We assume a fixed ujj for each attribute j. 
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